Navigating the E. coli functional landscape

Author

Daniel Martinez-Martinez, Filipe Cabreiro

Introduction

The partition of the existing bacteria into biological units have been debated over time, leaving us with a working framework where bacterial species can be divided based based on their genetic content. However, bacterial species are non-stationary groups that are in constant evolution, dynamically gaining and losing genetic material while keeping a core of functions unchanged even across large geographical distances. Moreover, the functional content of a given pair of strains can largely differ due to their evolutionary history, meaning that they can have distinct ways to operate with their environment. Besides, the functional characterization of these strains remains an open challenge due to their, sometimes, extremely large diversity and lack of experimental validation. This project is meant to explore the functional landscape of the Escharichia coli pangenome, one of the most sequenced bacterial species at this moment.

Results

Filtering strains to be analysed

Contigs and Scaffold status

Code
library(tidyverse)
library(readr)
library(readxl)
library(cowplot)
library(viridis)
library(patchwork)

theme_set(theme_cowplot(14))

Strain information has been downloaded from NCBI genomes. The metadata was downloaded on January 11th 2024. To reduce the number of E. coli genomes, the assembly level allowed was only for the types of “chromosome”, “complete genome” or “scaffold”. We can have a look at parameters such as the contig_n50/scaffold_n50 to see how the distribution of contigs looks like in the full cohort. We can see two different peaks, one near 0 that represents all the assemblies that have multiple contigs, and another around 5 megabases

Code
metadata = read_xlsx('ecoli_ncbi_metadata.xlsx')

metadata = metadata %>% 
  distinct(assembly_name, .keep_all = T)

metadata = metadata %>% 
  # remove the string after the dot in assembly_id
  mutate(assembly_id_simp = str_remove(assembly_id, "\\..*"),
         .before = "assembly_name")

metadata
Code
metadata %>% 
  ggplot(aes(contig_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Contig N50", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Code
# plot distribution of scaffold_n50
metadata %>% 
  ggplot(aes(scaffold_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Scaffold N50", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

It is even more evident that there are huge differences in terms of the scaffold N50 parameter in respect to their assembly level. As expected, we have a greater count of scaffold at chromosome or complete genome level compared to scaffold. The assemblies from the first two categories will be included, and now we need to include a threshold for the last category to remove strains with low scaffold N50.

Code
# plot scaffold distribution by assembly level with violin plot
metadata %>% 
  ggplot(aes(assembly_level, scaffold_n50, fill = assembly_level)) +
  geom_violin(show.legend = F) +
  # plot mean and sd as pointrange
  stat_summary(fun.data = mean_sdl, geom = "pointrange", 
               fun.args = list(mult = 1), color = "black",
               show.legend = F) +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Assembly level", y = "Scaffold N50") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis(discrete = T, begin = 0.2, end = 0.8) +
  theme_cowplot(14)

When we filter out the complete categories and then show the distribution of the scaffold N50, we have this distribution:

Code
# show scaffold distribution of assembly_level == scaffold
p1 = metadata %>% 
  filter(assembly_level == 'Scaffold') %>% 
  ggplot(aes(scaffold_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Scaffold N50", y = "Number of genomes") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_cowplot(14)

p2 = metadata %>% 
  filter(assembly_level == 'Scaffold') %>%
  filter(scaffold_n50 < 500000) %>% 
  ggplot(aes(scaffold_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Scaffold N50", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 


# show them with patchwork
p1 + p2

Left: Distribution of scaffold N50 in the full cohort. Right: Distribution of scaffold N50 in the filtered cohort with a threshold of 150K.
Code
metadata_filt = metadata %>% filter(scaffold_n50 > 100000)
dims = dim(metadata_filt)
dims[1]
[1] 11860

From the distribution of the scaffold N50 values, we can safely chose a threshold of ~ 150K as that will leave out E. coli assemblies that have a high amount of contigs/scaffolds.

In summary: we are choosing a threshold of 150K for the scaffold_n50 parameter from now on. Anything below that will be discarded.

After this filtering step we have 11860 genomes left.

CheckM filters

We need to filter the bad quality genomes from the cohort, which we can start doing by filtering out by checkM metricsL:

  • checkM completeness: the value that is usually applied ranges from 70%, but I’ll be using a more strict value of 95% because I want to have full genomes when possible.

  • checkM contamination: I am applying a value of 1% max contamination to avoid getting false information regarding the gene content.

Code
metadata_filt %>% 
  mutate(checkM_completeness_class = 
           case_when(is.na(checkM_completeness) ~ 'NA',
                     checkM_completeness < 95 ~ "< 95%",
                     checkM_completeness >= 95 ~ ">= 95%")) %>% 
  count(checkM_completeness_class) %>% print
# A tibble: 2 × 2
  checkM_completeness_class     n
  <chr>                     <int>
1 < 95%                       122
2 >= 95%                    11738
Code
metadata_filt %>% 
  mutate(checkM_contamination_class = 
           case_when(is.na(checkM_contamination) ~ 'NA',
                     checkM_contamination > 1 ~ "> 1%",
                     checkM_contamination <= 1 ~ "<= 1%")) %>% 
  count(checkM_contamination_class) %>% print
# A tibble: 2 × 2
  checkM_contamination_class     n
  <chr>                      <int>
1 <= 1%                       9421
2 > 1%                        2439
Code
metadata_filt = metadata_filt %>% 
  filter(checkM_completeness > 95,
         checkM_contamination < 1) %>% 
  filter(seq_length < 7000000)

Finally, after calculating the number of contigs per assembly, I have seen that some of them have a high number of contigs, which is not a good sign for the quality of the assembly. I have decided to filter out genomes with more than 300 contigs.

Code
metadata_filt = read_csv("tables/ecoli_ncbi_metadata_filt.csv")

p1 = metadata_filt %>% 
  # filter(Contig_number < 500) %>% 
  ggplot(aes(Contig_number)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Number of contigs", y = "Number of genomes") +
  theme_cowplot(14)

p2 = metadata_filt %>% 
  filter(Contig_number < 300) %>% 
  ggplot(aes(Contig_number)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Number of contigs", y = "Number of genomes") +
  theme_cowplot(14)

p1 + p2

Left: Distribution of contig number in the filtered cohort. Right: Distribution of contig number in the filtered cohort with a threshold of 300 contigs.
Code
metadata_filt = metadata_filt %>% 
  filter(Contig_number < 300)

Another filter that I have applied is using the phylogroup characterization of the strains with the tool EzClermont1. As we can see in the next plot with the phylogroup distributions, we have some that failed or that belong to cryptic groups. I have discarded these.

Code
phylogroups = read_tsv("tables/phylogroups_ezclermont.tsv",
         col_names = F) %>% 
  rename(genome_id = `X1`,
         phylogroup = `X2`) %>% 
  mutate(assembly_id = str_sub(genome_id, 1, 15))

phylogroups %>% 
  ggplot(aes(phylogroup, fill = phylogroup)) +
  geom_bar(show.legend = F) +
  labs(x = "Phylogroup", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
library(tidyverse)
metadata_final = read_csv("tables/MAIN_ecoli_NCBI_metadata.csv") 

dim_filt = metadata_final %>% dim

Summarising all the filtering steps I have performed:

  1. filtered out genomes with a scaffold N50 lower than 150K.

  2. filtered out genomes with a checkM completeness lower than 95%.

  3. filtered out genomes with a checkM contamination higher than 1%.

  4. filtered out genomes with a sequence length greater than 7Mbp.

  5. filtered out genomes with a contig number higher than 300.

  6. filtered out genomes that had a higher mash difference than 0.05.

  7. filtered out genomes that were not classified as one of the main E. coli phylogroups

Lab strains

In addition to the strains from the NCBI database, I have included the strains we are using in our lab. These strains are not part of the NCBI database, so they are not included in the filtering steps. In total we are adding 729 strains that belong also to the normal phylogroups as before.

Final number of strains

In summary, if we apply all filters, we are left with 9558 genomes.

Filtering strains to be analysed

The distribution of the different phylogroups represented in the E. coli cohort follow a similar trend as in other research publications2. The phylogroup B1 is the most represented, followed by phylogroup A and B2. The rest of the phylogroups are less represented, with phylogroup U being the least represented.

Code
phylo_colors = c("A" = "#1b9e77",
                 "B1" = "#d95f02",
                 "B2" = "#7570b3",
                 "D" = "#e7298a",
                 "E" = "#66a61e",
                 "F" = "#e6ab02",
                 "G" = "#a6761d",
                 "C" = "#3A254D",
                 "U" = "#666666")

metadata_final %>% 
  ggplot(aes(phylogroup, fill = phylogroup)) +
  geom_bar(show.legend = F) +
  labs(x = "Phylogroup", y = "Number of genomes") +
  scale_fill_manual(values = phylo_colors) +
  theme_cowplot(14) 

Methods

E. coli strain selection

The E. coli strains were selected from the NCBI genome database. The metadata was downloaded on January 11th 2024. The strains were filtered based on the following criteria: genomes that were not at the assembly level of “chromosome”, “complete genome” or “scaffold” were removed. The scaffold N50 was used to filter out genomes with a value lower than 150K. The checkM metrics were used to filter out genomes with a completeness lower than 95% and a contamination higher than 1%. Mash distances were calculated pairwise between all genomes, and those distances that were higher than 0.05 were removed. Finally, genomes with a sequence length greater than 7Mbp, and/or genomes with a contig count higher than 300 were removed. Genomes were dowloaded with the NCBI dataset download tool (v. 16.2.0). The final number of genomes was 9558.

Gene annotation and pangenome analysis

The gene annotation was performed with Bakta (v. 1.9.3) using the full database (v. 5.1) and by default parameters. E. coli phylogroups were determined with EzClermont (v. 0.7.0)1 based on the approach from CermonTyping3. Genomes that did not classify into the main E. coli phylogroups or that were cryptic were discarded for following analyses.

Function prediction

Proteinfer

Reference genes from panaroo were split into 4 smaller files to fit in memory. Proteinfer4 code was downloaded from github and function prediction was done by using 5 ensemble models and a reporting threshold of 0.3

Protein embeddings with ProtT5-XL-U50

The reference genes from panaroo was translated into proteins and sequences were clustered with CD-HIT v. 4.8.1 (similarity threshold of 0.98) to remove similar sequences from the dataset, resulting in 55942 unique clusters from the 57219 original sequences. The resulting file was split into 20 smaller files to fit into memory. Proteins were embedded with bio-embeddings pipeline (v.0.2.2), by using the model ProtT5-XL-U505 in half-precision mode. Proteins larger than 3000 amino acids were discarded to fit in memory. Transfer learning was done using available pipelines under bio-embeddings that used goPredSim6, using euclidean distances and a k_nearest_neighbours of 3. Prot T5 h5 file was used as a reference with GOA annotations from 2022.

Functional prediction with proteinfer, protein embeddings and transfer learning was carried in a computer with an AMD Ryzen 7 7800X3D processor, 32 GB or RAM, and an RTX 4080 with 16 GB of memory, in the WSL2 platform.

Phylogeny

From the core genome from panaroo, a subset of 150 genes, present in all strains, was extracted to build the phylogenetic tree of the pangenome. Before aligning each gene per separate, as some of them were in multicopy, only one gene per genome was kept for alignment. The alignment was done using mafft7 (v. 7. 526), and the tree was constructed with IQ-Tree (v. 2.3.6)8 with the GTR+I+G substitution model.

Software used and versions

  • R version 4.3.2

  • NCBI dataset download tool v. 16.2.0

  • Bakta v. 1.9.3

  • Panaroo v. 1.5.0

  • mash v. 1.1

  • EzClermont v. 0.7.0

  • PPanGGOLiN v. 2.0.5

  • interproscan v. 5.59-91.0

  • CD-HIT v. 4.8.1

  • bio-embeddings v. 0.2.2

  • IQ-tree v. 2.3.6

  • mafft v. 7.526

References

1.
Waters, N. R., Abram, F., Brennan, F., Holmes, A. & Pritchard, L. Easy phylotyping of escherichia coli via the EzClermont web app and command-line tool. Access Microbiology 2, (2020).
2.
3.
Beghain, J., Bridier-Nahmias, A., Le Nagard, H., Denamur, E. & Clermont, O. ClermonTyping: An easy-to-use and accurate in silico method for escherichia genus strain phylotyping. Microbial Genomics 4, (2018).
4.
Sanderson, T., Bileschi, M. L., Belanger, D. & Colwell, L. J. ProteInfer, deep neural networks for protein functional inference. eLife 12, (2023).
5.
Elnaggar, A. et al. ProtTrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Transactions on Pattern Analysis and Machine Intelligence 1–1 (2021) doi:10.1109/TPAMI.2021.3095381.
6.
Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Scientific Reports 11, (2021).
7.
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30, 772–780 (2013).
8.
Nguyen, L.-T., Schmidt, H. A., Haeseler, A. von & Minh, B. Q. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32, 268–274 (2014).